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Sparse  models  in  dictionary  learning  have  been  successfully  applied  in  a  wide  variety  of  machine  learning  and 
computer  vision  problems,  and  have  also  recently  emerged  with  increasing  research  interest.  Another  interesting 
related  problem  based  on  linear  equality  constraint,  namely  the  sparse  null  space  problem  (SNS),  first  appeared  in 
1986  and  has  since  inspired  results  on  sparse  basis  pursuit. 

In  this  paper,  we  investigate  the  relation  between  the  SNS  problem  and  the  analysis  dictionary  learning  problem,  and 
show  that  the  SNS  problem  plays  a  central  role,  and  may  be  utilized  to  solve  dictionary  learning  problems.  Moreover, 
we  propose  an  efficient  algorithm  of  sparse  null  space  basis  pursuit,  and  extend  it  to  a  solution  of  analysis  dictionary 
learning.  Experimental  results  on  numerical  synthetic  data  and  real-world  data  are  further  presented  to  validate  the 
performance  of  our  method. 
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Abstract.  Sparse  models  in  dictionary  learning  have  been  successfully  applied  in  a  wide  variety  of  machine 
learning  and  computer  vision  problems,  and  have  also  recently  emerged  with  increasing  research  interest.  Another 
interesting  related  problem  based  on  linear  equality  constraint,  namely  the  sparse  null  space  problem  (SNS),  first 
appeared  in  1986  and  has  since  inspired  results  on  sparse  basis  pursuit. 

In  this  paper,  we  investigate  the  relation  between  the  SNS  problem  and  the  analysis  dictionary  learning  problem, 
and  show  that  the  SNS  problem  plays  a  central  role,  and  may  be  utilized  to  solve  dictionary  learning  problems. 
Moreover,  we  propose  an  efficient  algorithm  of  sparse  null  space  basis  pursuit,  and  extend  it  to  a  solution  of  analysis 
dictionary  learning.  Experimental  results  on  numerical  synthetic  data  and  real-world  data  are  further  presented  to 
validate  the  performance  of  our  method. 

Key  words.  Dictionary  learning,  sparse  coding,  sparse  null  space  problem 
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1.  Introduction.  High  dimensional  data  analysis  has  been  the  focus  of  research  in  diverse 
areas,  including  machine  learning,  computer  vision,  and  applied  mathematics,  on  account  of  its 
theoretical  complexity,  and  its  high  relevance  to  big  data  problems.  Dictionary  learning  has  been 
one  of  the  key  methodologies  in  addressing  high  dimensional  data,  and  has  successfully  been  applied 
in  feature  extraction  [12],  for  signal  denoising  [12]  [10]  [24],  pattern  recognition  and  classification  [20], 
etc. 

Specifically,  a  set  of  atoms  learned  from  a  given  dataset  are  considered  as  a  dictionary,  and  are 
expected  to  have  the  potential  to  analyze  unknown  incoming  data.  In  order  to  construct  an  effective 
dictionary,  signal  models  play  a  key  role.  One  common  assumption  is  that  high  dimensional  data 
is  concentrated  in  a  low-dimensional  manifold  embedded  in  a  high  dimensional  space.  Dimension 
reduction  was  hence  a  natural  way  to  characterize  the  data,  and  was  subsequently  extended  to  a 
larger  family  of  algorithms,  and  often  referred  to  as  nonlinear  dimension  reduction  [23]  [9].  With  a 
different  perspective  on  the  same  issue,  recent  research  has  shown  that  sparse  models  may  also  be 
very  useful  for  learning  discriminating  and  robust  dictionaries  from  data  (see  [10]  and  references 
within).  While  these  were  broadly  applicable,  a  particularly  well  adapted  structure  of  data,  so-called 
a  union  of  subspaces  (UoS),  was  revealing  of  the  power  of  such  approaches.  In  particular,  using  a 
parsimony  constraint  on  the  number  of  atoms  to  represent  the  data  at  hand,  helps  effectively  recover 
the  underlying  basis  of  each  subspace  [13]  [3].  In  dictionary  learning,  synthesis  and  analysis  models 
were  proposed,  and  their  respective  strengths  and  limitations  in  obtaing  sparse  representations  of 
data,  are  also  of  interest.  On  the  one  hand,  synthesis  models  seek  a  synthesis  dictionary  D  = 
[di, . . .  ,dn]  such  that  ll*S'||o  <  where  is  the  data  of  interest,  dj  is  the 

atom  in  the  dictionary,  and  Wij  the  associated  coefficient.  On  the  other  hand,  an  operator  H 
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is  sought  in  an  analysis  model,  so  that  o  yields  a  sparse  coefficient  vector  for  representing 
Xi  [2]  [24]. 

Another  interesting  and  seemingly  unrelated  problem,  invoking  sparsity  is  the  sparse  null  space 
problem  (SNS),  and  was  first  proposed  in  1986  by  Coleman  and  Pothen  [8].  As  we  elaborate  further 
in  this  paper,  the  SNS  problem  may  be  stated  as  searching  for  a  sparse  basis  for  the  null  space  of 
a  given  matrix  A.  We  demonstrate  that  the  sparse  null  space  solution  is  instrumental  in  helping 
understand  the  analysis  dictionary  learning  problem,  and  in  providing  sufficient  insight  for  achieving 
systematic  and  applicable  solutions. 

In  this  paper,  we  examine  the  relation  between  the  SNS  problem  and  the  dictionary  learning 
problem,  and  we  prove  that  the  SNS  problem  is  equivalent  to  the  analysis  dictionary  learning 
problem  (ADL).  We  then  proceed  to  solve  the  ADL  problem  using  methods  for  the  SNS  problem. 
Specifically,  inspired  by  the  existing  results  for  SNS  problem  and  the  state-of-the-art  of  sparsity 
pursuit  algorithms,  we  present  a  li  minimization-based  greedy  algorithm  to  solve  the  SNS  problem. 
In  contrast  to  current  mainstream  algorithms  [10]  [24]  [20]  [15],  the  convergence  of  our  method  is 
assured  by  both  the  convergence  of  the  greedy  algorithm  and  the  convex  li  minimization.  Moreover, 
we  demonstrate  its  superior  performance  on  both  synthetic  numerical  data  and  real-world  data. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2,  we  discuss  the  current  state 
of  the  art  in  the  area  of  dictionary  learning,  specifically  in  ADL.  In  Section  3,  we  analyze  the 
relationship  of  the  SNS  problem  to  that  of  ADL,  and  show  their  equivalence.  In  Section  4,  we 
present  an  effective  method  to  solve  the  SNS  problem,  and  which  essentially  can  also  efficiently 
solve  the  ADL  problem.  Finally,  in  Section  5  we  validate  our  method  on  the  analysis  dictionary 
learning  problem  by  numerical  experiments,  and  illustrate  the  effectiveness  of  our  algorithm  on 
texture  classification. 

1.1.  Notation.  The  notational  conventions  used  throughout  this  paper  are  as  follows:  For  an 
m  X  n  matrix  X,  the  space  spanned  by  its  rows  is  denoted  as  row(X),  and  that  spanned  by  its 
columns  col(X).  Its  null  space  is  denoted  by  null(X).  The  sparsity  of  X,  defined  as  ,  is  denoted 
by  p(X).  Moreover,  we  denote  by  Px  the  projection  matrix  onto  col(X),  and  by  Px-  =I-ifPx 
the  projection  matrix  onto  null(X).  Additionally,  given  a  vector  y  G  operator  (•)j  returns  the 
value  of  the  element  of  y.  The  adjoint  operator  of  (•)j,  denoted  as  (•)*,  is  hence  as  follows: 
(c)*  =  V  G  such  that  {v)j  =  c  and  (v)^  =  0,  if  i  7^  j.  Finally,  we  denote  the  set  of  all  sparse 
vectors  {x|||x||o  <  k}  by 

2.  Related  work.  Much  of  the  reseach  in  dictionary  learning  has  primarily  focused  on  synthe¬ 
sis  models  such  as  [I]  [20]  [19]  [15],  among  many  others.  Remarkable  performance  has  been  achieved 
in  learning  effective  dictionaries  which  are  well  adapted  to  specific  datasets,  especially  on  imagery 
data.  The  contribution  of  the  atoms  in  the  representation  of  the  the  training  data  was  constrained 
to  be  sparse.  The  same  sparseness  is  exploited  to  recover  an  input  signal  from  corrupted  data  [I] 
or  to  classify  signals  into  different  clusters  [14]. 

Specifically,  the  learned  dictionary  and  the  corresponding  sparse  coefficients  are  alternately 
updated  untill  convergence  or  until  a  target  performance  attainment.  With  the  sparse  coefficients 
in  hand,  learning  each  atom  may  be  solved  by  the  gradient  descent  or  its  variations,  such  as 
the  stochastic  gradient  descent  [4].  With  the  dictionary  atoms  discovery,  different  approaches 
have  been  proposed  to  determine  the  sparse  coefficients.  In  K-SVD  [I],  for  example,  orthogonal 
matching  pursuit (OMP)  is  used  to  recover  the  sparse  coefficient,  while  Lasso-LARS  is  chosen  in 
online  dictionary  learning  for  sparse  coding  [20]. 

A  well  known  issue  with  synthesis  dictionary  learning  is  the  poor  stability  in  its  signal  repre- 
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sentation  [25]  [11].  This  is  primarily  due  to  the  difficulty  in  controlling  the  coherence  of  the  learned 
dictionary,  which  may,  in  turn,  lead  to  multiple  representations  of  the  same  signal  [5].  While  this 
phenomenon  does  not  particularly  adversely  affect  denoising  [11],  it  may  significantly  impact  a 
consistent  classification  or  clustering  performance.  In  addition,  the  computational  cost  to  process 
new  incoming  data  points,  often  calls  for  a  procedure  such  as  a  matching  pursuit  procedure  [1],  or 
a  sparse  coding  routine  such  as  Lasso  [20] ,  is  too  significant  for  purposed  of  large  high  dimensional 
data  sets. 

While  ADL  is  known  to  be  difficult  to  train,  it  is  free  of  the  above- noted  SDL  limitations  [25]. 
Additionally,  upon  learning  the  dictionary  D,  the  representation  of  any  data  x  is  unique  as  it  is 
the  result  of  one  matrix- vector  multiplication.  The  latter  fact  also  yields  a  low  computational  cost 
of  processing  new  data  which  is  both  linear  in  the  data  dimension  and  in  the  sample  size. 

Analysis  K-SVD  [25]  provides,  to  the  best  of  our  knowledge,  the  current  state-of-the-art  solution 
to  the  ADL  problem.  In  the  learning  procedure,  a  framework  similar  to  K-SVD  is  designed  to 
alternately  update  the  dictionary  and  the  sparse  coefficients.  At  each  iteration,  each  atom  is 
independently  updated  by  minimizing  the  covariance  between  the  atom  and  a  subset  of  data  samples 
that  are  “almost  orthogonal”  to  the  atom.  The  minimization  may  be  formulated  as  searching  for 
the  singular  vectors  corresponding  to  the  minimal  singular  value  of  the  data  matrix  of  the  subset 
of  samples.  Analysis  K-SVD  generally  achieves  the  best  known  performance  in  the  recovery  of  the 
original  data  space. 

As  a  departure  from  Analysis  K-SVD,  we  present  in  this  paper  an  alternative  ADL  algorithm 
which  is  closely  related  to  SNS  problem  discussed  next.  By  decomposing  the  DL  problem  into  a  set 
of  convex  optimization  subproblems,  we  can  guarantee  the  convergence  of  our  proposed  algorithm. 
Additionally,  and  in  contrast  to  Analysis  K-SVD,  our  proposed  method  can  naturally  adapt  in 
representing  data  of  varying  underlying  sparsity  with  no  prior  knowledge. 

3.  Prom  SNS  to  ADL.  In  this  section,  we  reformulate  the  SNS  problem  and  the  ADL 
problem  in  a  matrix  form,  and  then  establish  their  equivalence. 

Given  any  mxn  matrix  A  such  that  row  (A)  C  the  SNS  problem  may  be  defined  as  follows, 

(3.1)  SNS{A)  =  argrmn  ||N||o,  s.t.  col(N)  =  null(A). 

Let  X  =  [xi, . . . ,  Xn]  be  a  generic  data  matrix,  the  ADL  problem  can  then  be  rewritten  as 

(3.2)  ADL(X)  =  argrmn  ||U||o,  s.t.  DX  =  U,row(X)  =  row(U), 

where  D  is  an  analysis  operator  in  matrix  form,  and  U  is  the  associated  sparse  coefficient  matrix.  To 
avoid  a  trivial  solution  such  as  U  =  0,  we  further  require  row(X)  =  row(U).  Essentially,  this  is  the 
maximum  information  we  can  preserve  for  X,  since  all  rows  of  U  is  a  linear  combination  of  rows  in 
X,  and  hence  row(U)  C  row(X).  In  practice,  we  may  also  consider  the  case  that  row(U)  C  row(X) 
by  further  selecting  a  subset  of  di  in  D.  We  are  focusing  here  on  the  generic  formulation,  i.e. 
row(X)  =  row(U),  for  the  sake  of  theoretical  analysis,  and  will  elaborate  on  this  issue  later  in  the 
detailed  discussion  of  the  algorithm. 

We  note  that  finding  a  sparse  representation  of  the  null  space  in  Problem  3.1,  is  equivalent  to 
sparsifying  a  given  matrix  N  such  that  col(N)  =  null(A).  This  coincides  with  the  goal  of  Problem 
3.2  where  the  row  space  of  data  matrix  X  is  instead  invoked.  In  particular,  we  formally  have  the 
following  result. 

Theorem  3.1.  Assume  null(A)  =  row(X),  then  a  matrix'N  is  a  minimizer  of  the  SNS  problem 
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(as  shown  in  Eq.  (3.1)),  if  and  only  i/ is  a  minimizer  of  the  ADL  problem  (as  shown  in  Eq. 

(3.2) )  . 

Proof.  Assume  N  is  a  minimizer  of  (3.1),  then  the  constraints  in  (3.1)  ensure  that  null  (A)  = 
col(N).  Since  col(X^)  =  row(X)  =  null(A),  we  then  have 

(3.3)  col(X^)  =  col(N). 

We  next  consider  any  optimizer  U  of  problem  (3.2).  Since  row(X)  =  row(U),  when  combined  with 
the  condition  row(X)  =  null(A),  we  have  row(U)  =  row(X)  =  row(N^).  is  therefore  a  feasible 
solution  of  Eq.  (3.2),  and  so  is  of  Eq.  (3.1).  It  follows  that  ||U||o  =  ||N^||o,  and  hence  is 
also  a  minimizer  of  (3.2)  and  as  is  for  Eq.  (3.1).  □ 

This  essentially  tells  us  that  we  can  solve  a  sparse  dictionary  learning  problem,  should  we  have 
access  to  an  effective  method  of  solving  the  corresponding  SNS  problem.  Specifically,  given  a  data 
matrix  X  =  [xi, . . .  ,Xn],  the  analysis  dictionary  for  X  may  be  constructed  in  the  following  three 
steps: 

1.  Construct  a  matrix  A  such  that  row(A)  =  null(X),  i.e.  XA^  =  0  and  rank{A)  + 
rankifK)  =  n. 

2.  Eind  the  sparse  feature  vectors  by  solving  N  =  SNS  (A). 

3.  Construct  the  analysis  operator  D  from  DX  =  U. 

4.  An  iterative  sparse  null  space  pursuit.  We  have  discussed  the  relation  of  SNS  and 
ADL  in  Section  3,  and  have  shown  that  they  may  be  cast  in  one  unified  framework  of  sparse  null 
space  pursuit.  Nevertheless,  solving  SNS  is  itself  a  difficult  problem.  Coleman  and  Pothen  [8] 
have  proved  that  SNS  is  essentially  NP-hard,  hence  ruling  out  a  polynomial  time  algorithm.  We, 
however,  show  that  it  is  still  possible  to  approximate  the  sparse  null  space  basis  in  polynomial 
time.  In  this  section,  we  propose  an  /i-based  iterative  optimization  method  for  a  sparse  null  space 
pursuit. 

4.1.  A  greedy  algorithm  for  the  SNS  problem.  Previous  works  on  the  SNS  problem 
have  shed  some  light  towards  a  solution  in  polynomial  time.  In  [8],  the  authors  proposed  a  greedy 
algorithm  for  the  SNS  problem.  Eor  the  sake  of  clarity  and  further  discussion,  we  refer  to  it  as 
Algorithm  1.  Additionally,  it  has  been  proved  in  [8]  that  Algorithm  1  can  be  used  to  construct  a 
sparse  null  space  basis,  as  stated  in  Theorem  4.1  [8]. 


Algorithm  1  A  greedy  algorithm  for  sparse  null  space  problem 

Initialize:  matrix  A  G  N  =  0 

for  i  =  1,  ...  ,  q  do 

Eind  a  sparsest  null  vector  u^  such  that  rank(N  0  u^)  =  i. 

N  =  N  0  Ui 
eud  for 


Theorem  4.1.  A  matrix  N  a  sparsest  null  basis  of  A  if  and  only  if  it  can  be  constructed  by 
the  greedy  algorithm. 

It  is  worth  noting  that  the  maximum  number  of  iterations  q  in  Algorithm  1  is  constrained  by 
the  rank  of  A,  i.e.  q  =  d  —  rank{A).  Moreover,  this  greedy  algorithm  can  find  the  global  optimal 
solution  for  the  SNS  problem.  This  elegant  result  amounts  to  finding  the  sparsest  null  space  basis 
of  A  in  exactly  q  steps.  The  subproblem  of  finding  a  sparsest  null  vector  itself  is,  however,  also 
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a  NP-hard  problem  [8].  We  therefore  next  focus  on  finding  a  method  to  solve  this  subproblem  in 
each  iteration  of  Algorithm  1. 

4.2.  /i-based  search  for  sparse  null  space.  We  first  reformulate  the  subproblem  of  finding 
a  sparsest  null  vector  in  Algorithm  1  as  follows, 

min  ||ni||o, 

(4.1)  s.t.  An^  =  0,PNj__^ni  7^  0, 

where  is  the  subspace  spanned  by  the  constructed  null  space  vectors  in  the  previous  (i  —  1)^^ 

iteration.  The  condition  P^±  7^  0  implies  that  is  not  in  the  current  span  of  N,  and  hence 

rankCNi  0  n^)  =  rankCNi)  0  1. 

There  are  two  inherent  difficulties  in  this  formulation.  First,  ||  •  ||o  is  of  combinatorial  nature, 
hence  the  reason  of  the  NP-hardness  of  the  problem.  Second,  the  constraint  in  Eq.  (4.1) 

(4.2)  7^0, 

% 

defines  a  region  that  is  neither  compact  nor  convex.  To  address  the  first  problem,  we  propose  to 
take  advantage  of  established  results  on  sparsity  pursuit  via  h  minimization  [6]  [7].  To  address  the 
second  one,  and  to  hence  obtain  a  convex  and  compact  feasible  region,  we  propose  to  restate  the 
condition  Pj^^n^  7^  0  as  follows, 

(4.3)  3j  e  {1, . . . ,  d},  =  c, 

where  c  is  a  positive  constant. 

Additionally,  we  establish  the  following  lemma  to  justify  the  variation  on  the  constraint  from 
(4.2)  to  (4.3): 

Lemma  4.2.  The  solution  of  Problem  4-1  remains  invariant  if  the  eonstraint  (4-2)  is  substituted 
by  eonstraint  (4-3).  The  proof  of  Lemma  4.2  is  presented  in  Appendix  A. 

The  meaning  of  Lemma  4.2  is  that  we  may  then  separate  the  region  defined  by  (4.3)  into 
compact  and  convex  regions  based  on  j,  i.e.  the  location  of  the  forced  nonzero  element.  Since  the 
optimal  solution  must  reside  in  one  of  these  regions,  we  may  search  for  the  sparsest  null  vector  in 
each  region  from  j  =  1  to  d.  We  subsequently  have  a  convex  formulation  of  li  minimization  for 
each  j.  Algorithm  1  may  then  be  specifically  realized  as  Algorithm  2. 


Algorithm  2  Sparse  Null  Space  Basis  Pursuit 

Initialize:  matrix  A,  N  =  0 

for  i  =  1,  ...  ,  p 

do 

for  j  =  1,  ... 

,  d  do 

Find  nj  = 

=  argmin  |n  |i,  s.t.  An  =  0,  {Pj^±n)j  =  c 

end  for 

n^  =  argmin 

lldllo 

N  =  N  0  n^ 

end  for 
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This  is  tantamount  to  solving  the  following  optimization  problem  for  each  j  in  Algorithm  2, 

min  ||n||i, 

n 

(4.4)  s.t.  An  =  0,  {P^±n)j  =  c. 

It  is  worth  noting  that  the  exact  recovery  of  each  n  via  Eq.  (4.4)  is  determined  by  the  inco¬ 
herence  of  the  linear  operator  defined  by  the  two  constraints  and  the  sparsity  of  each  n.  To  solve 

(4.4) ,  we  adopt  the  framework  of  augmented  Lagrange  method  (ALM)  on  account  of  its  superior 
performance  matrix-norm  minimization  problems  [17]  [18].  Specifically,  we  have  the  augmented 
Lagrange  function  of  (4.4)  as 


L(n,Yi,  Y2,m)  =  ||n||i  +  (Yi,  An)  +  {Y2,(PN-n),-  -  c) 

(4.5)  +|||Anf +  |||(Pi,.n),-cf. 

The  primal  variable  n  is  first  updated  in  each  iteration  with  fixed  dual  variables  Ti,  Y2  and  ja.  By 
introducing  an  auxiliary  variable  7^,  we  have 

(4.6)  U  -  , 

where  T  is  the  soft-thresholding  operator,  and  \\r]\\‘^  >  ||A|p  +  ||Pna|P,  ^and 

(4.7)  =  A^  (ah*  +  Lli 

V  kk  / 

(4.8)  =  Pts!±  ( {Pni-  n)j  -  c  +  ^)  . 

V  J  j 

Next,  the  dual  variables  Yi,  Y2  and  fi  are  updated  as 

(4.9)  Ypi  =Yf +  Aifc(Anfe+i), 

(4.10)  Ypi  =Yf +  Aid(^N-n),-c), 

(4.11)  /i/c+i  =  min{p/r/c,  iimax}- 

The  strategy  of  linearized  ALM  method  provides  a  fast  convergence  rate  [18].  This  effectively 
provides  us  a  method  (Algorithm  2),  named  Sparse  Null  Space  Basis  Pursuit  (SNS-BP)  in  this 
paper,  to  solve  the  SNS  problem  efficiently. 

4.3.  Solving  the  ADL  problem  via  sparse  null  space  basis  pursuit.  In  Section  3,  we 
discussed  the  equivalence  of  the  ADL  problem  and  the  SNS  problem,  and  hence  further  describe 
the  details  of  solving  the  ADL  problem  (as  in  (3.2))  via  SNS-BP. 

For  a  typical  ADL  problem  as  in  Eq.  (3.2),  the  first  step,  as  discussed  in  Section  3,  is  to 
construct  a  matrix  A  whose  transpose  is  the  null  space  of  X.  Concretely,  we  have  the  following 
problem. 

Problem  1.  Find  A  such  that  XA^  =  0. 

^The  value  of  77  is  selected  in  this  way  to  insure  the  convergence  of  the  algorithm.  The  details  are  discussed  in 
[18]. 
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A  simple  way  would  be  to  consider  a  singular  value  decomposition  of  X,  and  keep  the  right 
singular  vectors  with  zero  singular  values  coinciding  with  the  rows  of  A.  For  a  common  scenario 
where  the  data  matrix  X  is  contaminated  by  Gaussian  noise,  we  can  set  A  to  the  right  singular 
vectors  with  small  singular  values,  instead  of  exactly  zero.  This  in  fact  offers  an  additional  advantage 
of  filtering  out  dense  Gaussian  noise  from  the  data  matrix  X. 

Upon  constructing  A,  we  proceed  to  obtain  the  sparse  coefficient  matrix  =  SNS{A).  Note 
that  U  is  computed  independently  of  the  analysis  operator  D.  In  case  D  is  required  for  further 
processing  incoming  data,  using  DX  =  U,  we  may  easily  obtain  D  =  UX’*',  where  X”*"  is  the  pseudo¬ 
inverse  of  X.  In  particular,  if  all  entries  of  the  dataset  are  all  independent,  i.e.  X  is  full  row  rank, 
then  Xt  =  X^(XX^)-h 

While  we  formulated  the  ADL  problem  with  the  constraint  row(U)  =  row(X)  in  Section  3,  if 
a  more  compact  representation  of  X  is  preferred,  we  may  opt  for  row(U)  C  row(X),  hence  further 
reducing  the  dimension  of  the  original  data  space. 

5.  Numerical  experiments  on  SNS  and  ADL.  For  a  quantitative  evaluation  of  our  al¬ 
gorithm,  we  synthesize  data  that  is  compatible  with  the  model  of  SNS  and  ADL,  and  show  that 
SNS-BP  is  able  to  reconstruct  the  sparse  null  space  basis  of  the  SNS  problem/the  sparse  coefficients 
of  the  ADL  problem. 

First,  we  synthesize  a  d  x  g  sparse  matrix  N  as  the  sparse  null  space  basis  of  some  matrix  A, 
where  A  may  be  constructed  by  exploiting  the  SVD  decomposition  of  N,  and  by  selecting  its  left 
singular  vectors  corresponding  to  the  zero  singular  values  as  the  rows  of  A,  i.e.  AN  =  0. 

All  elements  in  N  follow  a  binomial  distribution  as  zero/nonzero  entries.  Moreover,  the  ampli¬ 
tude  of  each  nonzero  element  is  generated  from  a  gaussian  distribution. 

The  matrix  A  can  then  be  seen  as  the  input  to  SNS-BP,  and  we  may  therefore  compare  the 
recovered  nullspace  basis  N  with  the  ground  truth  N.  In  Fig.  1,  we  show  one  example  of  exact 
recovery  of  a  sparse  nullspace  basis  up  to  permutation  and  scale. 


(a)  Randomly  generated  sparse  basis  N 


(b)  Recovered  sparse  nullspace  basis  N 
Fig.  1.  An  example  of  the  result  of  SNS-BP 


In  Fig. 2,  we  present  the  sparsity  level  of  N  with  the  sparsity  of  N  varying  from  0.01  to  0.2,  i.e. 
1%  nonzero  to  20%  nonzero.  If  our  method  works  well,  we  would  expect  it  to  find  the  sparsest  basis, 
and  therefore  p(N)  p(N),  i.e.  the  relative  sparsity  /)(N)//)(N)  1.  In  Fig. 2,  10  experiments  have 

been  carried  out  and  the  average  sparsity  is  calculated.  We  can  see  that  the  sparse  bases  discovered 
by  SNS-BP  have  similar  sparsity  with  N,  with  p(N)  varying  from  0.01  to  0.2.  Additionally,  we 
define  the  relative  error  of  N  as 


err(N)  = 


||NPr-N||^ 

I|n||f 


(5.1) 
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Sparsity 

Fig.  2.  ||N||o/||N||o  vs  Sparsity 


where  P  is  an  arbitrary  permutation  matrix,  and  F  is  a  diagonal  matrix  representing  the  scales  of 
each  sparse  basis.  The  average  relative  error  of  all  the  experiments  with  the  sparsity  of  N  varying 
from  0.01  to  0.2,  is  1.69%. 


Fig.  3.  Sample  synthetic  data  matrix  and  its  underlying  structure 


We  next  test  the  analysis  dictionary  learning  via  SNS-BP  by  exploring  data  samples  with  hidden 
underlying  sparse  structures.  In  particular,  data  samples  are  randomly  selected  from  a  union  of 
low-dimensional  subspaces  S  =  Si  U  S2  -  -  - ,  which  each  subspace  is  also  randomly  constructed  by 
using  the  orthogonal  basis  of  a  set  of  uniformly  distributed  vectors.  Under  this  setting,  each  sample 
can  be  represented  as  a  linear  combination  of  other  samples  in  the  same  subspace.  The  dataset, 
written  as  a  matrix  X  =  [xi,X2,  •  •  •  ,Xn]  as  shown  in  Fig. 3(a),  has  a  sparse  intrinsic  structure  W 
such  that  X  =  XW,  where  VF  is  a  block-diagonal  matrix  as  Fig. 3(b),  and  each  block  represents 
one  subspace.  In  our  experiment,  we  have  data  points  distributed  in  five  3-dimensional  subspaces 
within  the  ambient  space  It  hence  implies  that  the  nullspace  of  A  constructed  from  X  is  of 

dimension  15.  An  analysis  dictionary  D  is  then  trained  using  SNS-BP,  and  the  associated  sparse 
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coefficient  matrix  U  is  obtained  as  shown  in  Fig. 4(a).  Specifically,  we  can  see  that  all  nonzero 
entries  of  each  sparse  vector  are  clustered  together,  corresponding  to  data  samples  from  the  same 
subspace.  In  other  words,  for  an  atom  di  in  the  analysis  dictionary  D,  all  the  data  points  that  have 
a  significant  response  to  it,  are  from  one  subspace,  and  the  rest  of  the  data  have  zero  response. 


(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


(b)  Permute  rows  of  U  to  show  the  structure  of  X 


Fig.  4.  Sparse  coefficients  using  learned  analysis  dictionary 


We  next  cluster  the  rows  of  U  and  permute  them  accordingly,  as  is  presented  in  Fig. 4(b).  It  is 
interesting  to  see  that  a  more  compact  block-diagonal  structure  emerges  again.  Note  that  in  this 
example  we  find  in  totality  15  atoms,  with  three  atom  sets  each  supporting  the  data  samples  (have 
largely  nonzero  inner  product)  in  a  subspace.  This  number  corresponds  to  the  intrinsic  dimension 
of  each  subspace.  Having  trained  D,  it  is  subsequently  simple  to  figure  out  which  subspace  a  data 
point  belongs  to:  we  may  simply  separate  D  into  D  =  [Di,  D2,  •  •  •  ],  with  being  the  set  of  atoms 
supporting  the  ith  subspace,  and  ultimately  determine  the  maximal  ||Dix||.  It  is  therefore  more 
efficient  to  recover  the  underlying  structure  of  a  given  dataset  and  represent  this  structure  in  a 
more  compact  way. 


5.1.  In  Relation  to  Analysis  K-SVD.  Analysis  K-SVD  arguably  provides  the  state-of- 
the-art  solution  for  the  ADL  problem,  and  has  achieved  a  much  improved  performance  in  signal 
denoising  by  discovering  the  underlying  UoS  structure  of  a  wide  class  of  signals  [25].  We  next 
comparatively  evaluate  the  performance  of  SNS-BP  and  Analysis  K-SVD  in  the  recovery  problem 
of  UoS  of  synthesized  data. 

The  synthesis  of  data  from  a  union  of  subspaces  is  similar  to  the  setting  in  Section  5.  In 
particular,  data  samples  are  randomly  chosen  from  a  union  of  low-dimensional  subspaces  S  = 
Si  U  S2  "  '  ^  in  which  each  subspace  is  also  randomly  constructed  by  using  the  orthogonal  basis 
of  a  set  of  uniformly  distributed  vectors.  In  generating  our  data,  we  pick  an  ambient  space  of 
dimension  100,  and  collectively  5  subspaces.  We  evaluate  the  performance  of  SNS-BP  and  Analysis 
K-SVD  by  varying  the  intrinsic  dimensions  of  each  subspace.  Interestingly,  SNS-BP  shows  constant 
performance  with  the  increase  of  the  intrinsic  dimension  of  each  subspace,  while  the  performance 
of  Analysis  K-SVD  deteriorates  when  the  intrinsic  dimension  of  each  subspace  increases. 
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(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


(b)  The  sparse  coefficient  matrix  U  by  ADL  using  Analysis  K-SVD 

Fig.  5.  Sparse  coefficients  for  5  sub  spaces  with  intrinsic  dimension  3 


As  shown  in  Figs.  5  and  6,  when  the  intrinsic  dimension  is  low,  both  SNS-BP  and  Analysis 
K-SVD  can  effectively  learn  atoms  from  the  dataset  which  yield  sparse  coefficients  to  efficiently 
reffect  the  underlying  subspace  of  each  data  point.  More  precisely,  each  atom  has  a  strong  response 
on  data  in  only  one  subspace,  and  is  rather  absent  in  other  subspaces.  Five  blocks  corresponding 
to  five  subspaces  appear  in  the  sparse  coefficient  matrix  using  both  approaches.  SNS-BP  exhibits 
a  clearly  improved  performance  over  Analysis  K-SVD,  in  recovering  the  intrinsic  dimension  of  each 
subspace,  as  shown  in  Fig.  5  where  the  intrinsic  dimension  of  each  subspace  is  3,  and  in  Fig.  6  as  the 
dimension  increases  to  5.  This  may  be  understood  by  recalling  that  the  algorithm  SNS-BP  ensures 
that  the  sparse  vector  determined  at  each  iteration  is  orthogonal  to  the  subspace  spanned  by  the 
previously  found  sparse  vectors  (i.e.  truly  novel).  Equivalently  said,  the  rows  in  Fig.  5(a)/Fig.  6(a) 
are  linearly  independent-  a  property  whichh  provides  an  accurate  estimate  of  the  intrinsic  dimension 
of  each  subspace.  In  contrast,  since  Analysis  K-SVD  can  possibly  find  collinear  sparse  vectors,  it 
yield  more  sparse  vectors  than  the  dimension  of  a  given  subspace,  and  may  hence  partially  miss 
another  subspace. 


(b)  The  sparse  coefficient  matrix  U  by  ADL  using  Analysis  K-SVD 


Fig.  6.  Sparse  coefficients  for  5  sub  spaces  with  intrinsic  dimension  5 


Sparsity  and  Nullity:  Paragidms  for  Analysis  Dictionary  Learning 


11 


1 

1 

(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


(b)  The  sparse  coefficient  matrix  U  by  ADL  using  Analysis  K-SVD 


Fig.  7.  Sparse  coefficients  for  5  sub  spaces  with  intrinsic  dimension  1 


In  Figs.  7,  8,  9  and  10,  the  intrinsic  dimension  of  each  subspace  is  7,  11,  13  and  15,  respectively. 
As  the  intrinsic  dimension  increases.  Analysis  K-SVD  begins  to  learn  atoms  which  span  two  or  more 
subspaces.  Such  cases  become  more  prevalent  as  the  intrinsic  dimension  of  each  subspace  increases. 
In  Fig.  10  for  instance,  the  5  coefficient  blocks  representing  the  5  subspaces  merge  and  display 
significant  responses  from  two  or  more  subspaces.  SNS-BP,  in  contrast,  exhibits  a  more  consistent 
performance  even  as  the  intrinsic  dimension  of  each  subspace  increases,  thus  resulting  in  recovering 
a  correct  dimension  for  each  subspace. 
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(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


Fig.  8.  Sparse  coefficients  for  5  subspaces  with  intrinsic  dimension  11 
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(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


(b)  The  sparse  coefficient  matrix  U  by  ADL  using  Analysis  K-SVD 


Fig.  9.  Sparse  coefficients  for  5  sub  spaces  with  intrinsic  dimension  13 
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(a)  The  sparse  coefficient  matrix  U  by  ADL  using  SNS-BP 


(b)  The  sparse  coefficient  matrix  U  by  ADL  using  Analysis  K-SVD 


Fig.  10.  Sparse  coefficients  for  5  subspaces  with  intrinsic  dimension  15 


In  Fig.  11,  we  show  a  case  of  UoS  with  different  randomly  generated  dimensions.  It  worth 
noting  that  given  the  ambient  space  of  i?^00,  we  limit  the  intrinsic  dimension  of  each  subspace 
between  1  and  19,  to  avoid  the  extreme  case  that  when  every  subspace  is  of  dimension  20,  then  we 
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may  run  into  the  situation  that  all  samples  are  randomly  chosen  from  i?^00  and  lose  any  non-trivial 
UoS  structure.  In  this  case,  Analysis  K-SVD  misses  one  subspace  with  the  lowest  dimension  entirely 
after  learneding  45  atoms,  and  a  non- negligible  portion  of  the  dictionary  fails  to  distinguish  data 
from  different  subspaces.  SNS-BP  shows  a  clear  5-subspace  of  the  same  dataset. 


Fig.  11.  Sparse  coefficients  for  5  subspaces  with  intrinsic  dimension  randomly  picked  between  1  and  19  as  9, 
8,  3,  12,  13 


We  further  thoroughly  test  the  performance  of  SNS-BP  and  Analysis  K-SVD  on  a  dataset 
with  intrinsic  dimension  from  5  to  16  using  the  measure  of  intra-cluster  covariance  ratio.  For  each 
dimension,  we  repeat  the  process  of  data  generation  20  times,  to  obtain  the  mean  value  of  the 
intra-cluster  correlation  ratios  for  both  methods. 

We  define  the  intra-cluster  covariance  ratio  as 


(5.2) 


CV{\5)  = 


Intra-cluster  covariance  of  U 
total  covariance  of  U 


In  Fig.  12,  we  can  see  that  the  CV  of  Analysis  K-SVD  deteriorates  from  1  to  around  0.75  with 
the  increase  of  the  intrinsic  dimension  of  each  subspace.  While  SNS-BP  maintains  a  consistently 
high  performance  throughout. 
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Fig.  12.  Intra- cluster  covariance  ratio  for  SNS-BP  and  Analysis  K-SVD.  The  x-axis  is  the  intrinsic  dimension 
of  each  subspace,  and  the  y-axis  is  the  CV  value. 

We  further  test  the  performance  of  SNS-BP  and  Analysis  K-SVD  by  randomly  choosing  the 
intrinsic  dimension  of  each  subspace  between  1  and  19.  In  this  experiment,  the  intrinsic  dimension 
of  each  subspace  can  be  different.  We  repeat  the  data  generation  100  times  to  obtain  mean  CV 
values  for  both  methods.  The  results  show  a  higher  CV  value  1.00  of  SNS-BP  versus  0.97  of  Analysis 
K-SVD. 

All  these  experiments  in  this  part  prove  a  more  robust  performance  of  SNS-BP  compared 
to  Analysis  K-SVD  when  the  intrinsic  dimension  of  each  subspace  is  higher.  Moreover,  SNS-BP 
demonstrates  an  ability  of  recovering  the  intrinsic  dimensionality  of  each  underlying  subspace,  and 
of  automatically  avoiding  redundant  atoms. 

5.2.  Applications  on  real-world  data.  In  this  experimental  section,  we  explore  the  infer¬ 
ence  potential  of  our  method  on  images.  The  performance  of  our  algorithm  is  evaluated  on  texture 
images  from  the  Brodatz  database  [22].  Each  texture  image  is  partitioned  into  a  set  of  patches, 
and  their  analysis  operator  is  learned  from  patches  of  different  textures.  The  latter  is  subsequently 
applied  to  incoming  data,  which  is  also  segmented  into  sets  of  patches.  The  properties  of  various 
textures  may  lead  to  different  patterns  of  the  corresponding  sparse  coefficients.  For  example,  the 
texture  with  more  randomness  may  lead  to  less  sparsely  structured  coefficients  and  more  coher¬ 
ent/correlated  textures,  i.e.  on  account  of  a  broadly  spread  distribution  of  its  patches,  discovered 
upon  applying  the  learned  operator  to  incoming  data.  Note  that  we  avoid  the  complexity  of  pro¬ 
cessing  the  order  of  patches  by  considering  the  distribution  of  the  coefficients  instead  of  matching 
the  output  vectors. 

Specifically,  we  segment  each  texture  image  into  10  x  10  patches,  and  randomly  pick  a  subset 
of  120  patches  as  the  training  set  from  each  texture  image,  and  the  rest  of  the  patches  are  used  as 
a  testing  set.  The  texture  images  from  the  Brodatz  database  [22],  and  the  corresponding  sample 
patches  are  shown  in  Fig.  13.  In  our  experiment,  we  first  train  the  analysis  operator  by  using 
half  of  the  data  in  the  training  set  without  knowing  the  label  of  each  patch,  and  then  calculate 
the  distribution  of  the  coefficients  Pi  of  the  rest  of  the  patches  from  the  class  of  texture  in  the 
training  set.  In  the  next  testing  stage,  texture  images  are  used  as  a  set  of  patches,  for  which  the 
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distribution  of  the  coefficients  Uj  =  DXj  with  all  Pi  are  compated.  In  particular,  we  assign  Xj  to 
the  class  with  the  closest  distribution,  such  as 

(5.3)  classCKj)  =  aig min d{Pi,  Pij.). 

We  use  the  total  variation  distance  in  ((5.3),  as  defined  in  [16], 

(5.4)  d{p,q)  =  \\p-q\\TV  =  ^^2  Wd 
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Fig.  13.  Textures  and  the  eorresponding  patehes  for  training 

In  this  experiment,  the  classification  rate  is  97.78%  for  the  texture  images  shown  in  Fig.  13. 
The  performance  is  higher  than  known  state  of  the  art  methods  based  on  predesigned  features, 
such  as  [26]  with  a  86.63%  classification  rate,  and  comparable  to  the  supervised  dictionary  learning 
algorithm  [21].  It  isimportant  to  note  that,  the  training  set  in  our  case,  is  only  composed  of  around 
1%  of  the  dataset.  A  less  stringent  training  set  implies  a  lower  computational  cost.  This  also 
demonstrates  the  scalability  of  our  method,  in  light  of  its  competitive  classification  performance. 

6.  Conclusion.  We  have  proposed  in  this  paper,  a  novel  approach  for  the  sparse  nullspace 
problem,  and  have  unveiled  its  equivalence  to  the  ADL  problem.  We  have  presented  the  SNS-BP, 
an  iterative  algorithm  based  on  li  minimization,  to  pursue  the  solution  of  the  SNS  problem.  We 
have  further  applied  this  algorithm  to  analysis  dictionary  learning,  and  show  the  efficacy  of  our 
approach  by  experiments  on  both  synthetic  dataset  and  real-world  data  in  texture  classification. 

Future  work  may  include  several  aspects  related  to  both  SNS  problem  and  ADL.  The  relation 
between  SNS  and  nonlinear  dimension  reduction  needs  further  investigation  and  may  lead  to  results 
on  graph  embedding.  Moreover,  our  future  goal  is  to  explore  the  potential  application  of  ADL  on 
other  high  dimensional  database,  such  as  image/video  classification. 

Appendix.  Proof  of  Lemma  4.2. 

Lemma  4.2  states  that  the  following  optimization  problems  are  equivalent. 
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min||n||o, 

n 

(A)  s.t.  An  =  0,  -Pn^  7^  0) 


min||n||o, 

n 

(B)  s.t.  An  =  0,  3j  G  {1, . . . ,  d},  {P]y±n)j  =  c. 

Proof.  First,  we  show  that  if  n  is  an  optimal  solution  of  (A),  then  for  some  real  number  a,  an 
is  also  a  minimizer  of  (B). 

Any  minimizer  n'  of  (B),  also  satisfies  the  constraints  of  (A).  It  hence  follows  that 
(A.l)  l|n'||o>||n||o. 


Assume  |(PNAn)/c|  =  ||PNAn||oo,  and  noting  that  ||PNAn||oo  7^  0,  we  construct 


(A.2) 


c 

{Pj^±n)k 


n  =  an. 


Since  {Pf^±n)k  =  c,  h  is  also  feasible  in  (B).  Suppose  that  ||h||o  =  ||n||o  <  ||n'||o,  and  in  combination 
of  Eq.  A.l,  we  can  conclude  that  h  =  an,  and  is  also  a  solution  of  (B),  and  therefore  ||h||o  =  ||n'||o. 

Then  it  is  trivial  to  show  that  n'  is  also  a  minimizer  of  (A),  given  the  fact  that  n'  is  a  feasible 
solution  of  (A)  and  ||n'||o  =  ||h||o  =  ||n||o,  thus  proving  Lemma  4.2.  □ 
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